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Abstract 

We consider the problem of on-line prediction of real-valued labels, 
assumed bounded in absolute value by a known constant, of new objects 
from known labeled objects. The prediction algorithm's performance is 
measured by the squared deviation of the predictions from the actual 
labels. No stochastic assumptions are made about the way the labels 
and objects are generated. Instead, we are given a benchmark class of 
prediction rules some of which are hoped to produce good predictions. 
We show that for a wide range of infinite-dimensional benchmark classes 
one can construct a prediction algorithm whose cumulative loss over the 
first N examples does not exceed the cumulative loss of any prediction 
rule in the class plus 0{^N); the main differences from the known results 
are that we do not impose any upper bound on the norm of the considered 
prediction rules and that we achieve an optimal leading term in the excess 
loss of our algorithm. If the benchmark class is "universal" (dense in the 
class of continuous functions on each compact set), this provides an on- 
line non-stochastic analogue of universally consistent prediction in non- 
parametric statistics. We use two proof techniques; one is based on the 
Aggregating Algorithm and the other on the recently developed method 
of defensive forecasting. 

Remark The main difference of the current (second) version of this tech- 
nical report from the previous version is a fuller discussion of the related 
literature. The latter is, however, massive, and it is likely that even in 
this version some important related results are missing. 



1 Introduction 

The traditional, and still dominant, approach to the problem of regression is 
statistical: the objects and their real-valued labels are assumed to be generated 
independently from the same probability distribution, and a typical goal is to 
find a prediction rule with a small expected loss. A newer approach is "compet- 
itive on-line regression" , in which the goal is to perform almost as well as the 
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best rules in a given benchmark class of prediction rules. (See, e.g., PH], §1, or 
|54| . §4, for reviews of some relevant literature.) Unlike the statistical theory of 
regression, no stochastic assumptions are made about the data. 

A great impetus for the development of the statistical theories of regression 
and pattern recognition (see, e.g., |2H1 and, especially, jEij Preface and Chapter 
1) has been Stone's 1977 result that there exists a "universally consistent" 
prediction algorithm: an algorithm that asymptotically achieves, with probabil- 
ity one (or high probability), the best possible expected loss. The property of 
universal consistency is very attractive, but it is asymptotic and does not tell 
us anything about finite data sequences. Stone's result provided a direction in 
which more practicable results have been sought. 

Surprisingly, it appears that universal consistency has not been even defined 
in competitive on-line learning theory. We propose such a definition in ^ in 
13 we will see how close papers such as came to constructing universally 

consistent algorithms. However, our Corollary ^ in fjSl appears to be the first 
explicit statement about the existence of the latter. 

As in the case of statistical regression, universal consistency is only a minimal 
requirement; one also wants good rates of convergence, ideally not involving 
unknown constants, for universal benchmark classes. The notion of universality 
is discussed, formally and informally, at the end of 321 and in ^ we will argue 
that universality for benchmark classes is a matter of degree. Our main results, 
TheoremsQJlSl are stated in f|21and proved in §Sini[Hl They describe properties of 
universality of our prediction algorithms, some of which are described explicitly 
in the last section, MIOI In ^ Thcorcmnis applied to the case where the objects 
and their labels are drawn independently from the same distribution. In f^l^e 
consider some interesting benchmark classes of prediction rules, and in fj^we 
compare our results to some related ones in the literature. 

In this paper we use two very different proof techniques: the old one in- 
troduced in [SSJ 1^ and the one developed in ; we are especially interested 
in the latter since it appears much more versatile, and competitive on-line re- 
gression is a good testing ground to develop it. This technique has its origin 
in Foster and Vohra's paper which demonstrated the existence of a ran- 
domized forecasting strategy that produces asymptotically well-calibrated fore- 
casts with probability one. Foster and Vohra's result was translated into the 
game-theoretic foundations of probabihty (see, e.g., 02]) in HHI' In June 2004 
Akimichi Takemura further developed the method of [HHI showing that for any 
continuous game-theoretic law of probability there exists a forecasting strat- 
egy that perfectly satisfies this law of probability; such a strategy was called a 
"defensive forecasting strategy" in [SUj. An important special case of defensive 
forecasting is where the law of probability asserts good calibration and resolution 
of the forecasts; it was explored in |56j . where, in particular, a non-asymptotic 
version of Foster and Vohra's result was proved. In [SHI it was shown that the 
corresponding forecasting strategies lead to a small cumulative loss in a fairly 
wide class of decision protocols. That paper only dealt with the case of binary 
classification, and in this paper similar results are proved for on-line regression. 
As the loss function wc use square-loss, which leads to significant simplifications 
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as compared with [SS] . (Despite being the source of our approach, our proof 
technique appears to have lost all connections with that paper and papers, such 
as 123 mill Eg, further developing it.) 

Our results are closely related to those of Cesa-Bianchi et al. and Aucr 
et al. but we postpone a detailed discussion to H 

2 Main results 

The simple perfect-information protocol of this section is: 

FORn= 1,2,...: 

Reality announces x„ G X. 

Predictor announces /i„ £ M. 

Reality announces ?/„ G [—y, F]. 
END FOR. 

At the beginning of each round n Predictor is shown an object Xn whose label yn 
is to be predicted. The set of a priori possible objects is called the object space 
and denoted X; of course, we always assume X 7^ 0. After Predictor announces 
his prediction /i„ for the object's label he is shown the actual label y„ G M. We 
assume known an a priori upper bound Y G (0, 00) on the absolute values of 
the labels y„. We will sometimes refer to pairs (a;„,?;„) as examples. By an 
on-line prediction algorithm we mean a strategy for Predictor in this protocol; 
in this paper, however, we are not concerned with computational complexity of 
our prediction algorithms. 

Predictor's loss on round n is measured by (y„ — IJ^nY i and so his cumulative 
loss after N rounds of the game is YlH^iiVn ~ P-n)"^ ■ His goal is "universal 
prediction" , in the following, rather vague, sense. If D : X ^ M is a "prediction 
rule" (i.e., the function D is interpreted as a rule for choosing the prediction 
based on the current object), he would like to have 



meaning "not much greater than" ) provided D is not "too complex" . Tech- 
nically, wc will be interested in the case where the prediction rule D is assumed 
to belong to a large reproducing kernel Hilbert space (to be defined shortly) and 
the complexity of D is measured by its norm. 

As already mentioned, the results of this section are closely related to several 
results in and [HI; see ^ 

Reproducing kernel Hilbert spaces 

A reproducing kernel Hilbert space (RKHS) on a set Z (such as Z ^ X) is a 
Hilbert space T of real- valued functions on Z such that the evaluation functional 
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f ^ T ^ f{z) is continuous for each z E Z. Wc will use the notation cjr[z) for 
the norm of this functional: 



cyr{z) := sup \f{z)\. 
/:||/ll^<i 

Let 

:= supc^(z); (2) 

we will be interested in the case cjr < oo. 
Examples of RKHS will be given in 21 

Main theorems 

Suppose Predictor's goal is to compete with prediction rules D from an RKHS 
on X. The three theorems that we state in this subsection bound the difference 
between the left-hand and right-hand sides of 1^; this bound will be called the 
regret term. The simplest regret term, given in the first theorem, is in terms of 
c^, ||D||^, and N. 

Theorem 1 Let T he an RKHS on X. There exists an on-line prediction al- 
gorithm producing fin G [^Y, Y] that are guaranteed to satisfy 



N N 

E 

n—1 n—1 



I 

iVn - M„)' < E(y» - ^(^»))' + ^Y^cjr + 1 {\\D\\^ -h r) \/iV (3) 



for all N ^1,2,... and all D E T. 

The regret term in the second theorem is in terms of cjf. and the cumu- 

lative loss of D (which can be significantly less than N). 

Theorem 2 Let T he an RKHS on X. There exists an on-line prediction al- 
gorithm producing yU„ G Y] that are guaranteed to satisfy 

N N 



n—1 n—1 

+ 2y^4 + i(||Dl|^ + y) 



N 

Y^r,,., - n(rr_A\2 I 4- lUii nil _ ^ v^^ 



. ^{yn - D{XnW + (C^ + 1) (ll^ll^ + y 
\ n=l 



+ 2{c'^ + l)i\\D\\^ + Yf (4) 

for all N and all D E T . 

The regret term of Theorem|2is close to being stronger than that of Theorem^ 
the former is at most twice as large as the latter plus an additive constant, if 
we restrict our attention to the prediction rules D such that ||-D||;(r is bounded 
by a constant and 113(2:) | < F, Vx G X. 
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On-line prediction algorithms achieving Q and I^J will be stated explicitly in 
mOI They are based on the idea of defensive forecasting. However, the regression 
problem considered in this paper is very well studied, and one can hardly hope 
to beat the known techniques. The next theorem gives an upper bound of the 
regret term achievable by using the procedure ("Aggregating Algorithm", or 
AA) described in and applied to the problem of regression in [SI] and [^H] ■ 
A popular alternative technique based on the gradient descent method could 
also be used, but it tends to lead to worse leading constants: see S^for details. 

Theorem 3 Let J- be a separable RKHS on X. There exists an on-line predic- 
tion algorithm producing /x„ G [— Y] that are guaranteed to satisfy 

N N 



n=l 



^2 J \ 2 ' ' 2^2 

JV 

< ^{y„-D{x„))^ + 2Ymax(cjr\\D\\jr,YSN-^^^+^'^ VFT2 

n=l 

+ ^Y^lnN+^^^^p^ + OiY^) (5) 

for all N ~ 1,2,... and all D ^ J- , where 5 > is an arbitrarily small constant, 
r is the gamma function Chapter 6), and U is Rummer's U function 
Chapter 13). The constant implicit in 0{Y'^) depends only on 5. 

The bound of Theorem |21 is even closer to being stronger than that of The- 
orem ^ as N oo: the leading constant is the same, 2ycjF HZ?!] (assuming 
||-D||jr ^ Y and Cjp ':$> 1), but the other terms arc considerably better. The 
main disadvantage of the bound (0) is the asymptotic character of (namely, the 
presence of the O term in) its more explicit version. The version involving the 
gamma and Kummer's U functions is not intuitive, but it can be evaluated using 
standard libraries; the function 

/(iv,<i):.-i.(r(f + i)t/(^ + i,o4 

is plotted in Figure ^ 

The condition of separability in Theorem |21 docs not appear restrictive; in 
particular, it is satisfied for all examples considered in 21 

Finally, we give a lower bound (a version of Theorem VII. 2 in [ISp showing 
that the leading constant 2Ycjr ||-D||jr is optimal. 

Theorem 4 Suppose the object space is X = R. For any positive constant 
c there exists an RKHS T on li. with Cjr = c and a strategy for Reality sat- 
isfying the following property. For any N — 1,2,..., any positive constant 
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Figure 1: The graph of the function f{N, d) for N ^ 1, . . . , 100 and d e [1, 3]. 
The two final values at the corners are /(lOO, 1) w 12.37 and /(lOO, 3) w 30.15. 



d < (Y/cyr)y^N , and any on-line prediction algorithm, there exists a prediction 
rule D T such that \\D\\jr = d and 

N N 

Y^iVn - tin? > - D{xn)f + 2Yc^ \\D\\^ ViV - Ili^ll^ , (6) 

n—1 n=X 

where, as usual, /i„ are the predictions produced by the on-line prediction algo- 
rithm and {xn,yn) are Reality's moves. 

Theorems and 0] are proved in ^JHl and respectively. From the proof of 
Theorem 0] it will be clear that similar lower bounds also hold when X = M is 
replaced by any regular (e.g., open) subset of a Euclidean space. 

Remark If Cjr = oo but it is known in advance that all objects Xn, n = 1,2,.. ., 
will be chosen from a set A C X satisfying X := sup^g^ cjr(x) < oo, Theorem 
nj0]will continue to hold when Cjf is replaced by X. 

Universal consistency 

We say that an RKHS !F on Z is universal if Z is a topological space and for 
every compact subset A oi Z every continuous function on A can be arbitrarily 
well approximated in the metric C (A) by functions in in the case of compact 
Z this coincides with the definition given in IH' (Definition 4) . All examples of 
RKHS given in ^Ql^^rs universal. 

Suppose the object space X is a topological space; as in the rest of the paper, 
we are assuming that are bounded by a known constant Y. Let us say that 
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an on-line prediction algorithm is universally consistent if its predictions fin 
always satisfy 



(.T„ e A,Vne {1,2,...}) 



f 1 ^ 1 ^ \ 

limSUp — - Mn)^ - j;^'^iyn - D{Xn)f 1 



<0 (7) 



for any compact subset A of X and any continuous decision rule D (cf. (|Hl). 
By the Tietze-Uryson theorem (^Hj, Theorem 2.6.4 on p. 65), if X is a normal 
topological space, we will obtain an equivalent definition allowing D to be any 
continuous function from A to M. 

The definitions of this subsection arc most intuitive in the case of compact 
X, and in our informal discussion we will be making this assumption. The 
main remaining difference of our definition of universal consistency from the 
statistical one 0^1 is that we require D to be continuous. If D is allowed to be 
discontinuous, Q is impossible to achieve: no matter how Predictor chooses his 
predictions Reality can choose 

E^signQxj) _ j 1 if^„<0 

,3* —1 otherwise 

(assuming X D [—1, 1] and y > 1), foiling (jT)) for the prediction rule 

D{x) 



-1 if x < ^,^iSign(^i)/3* 
1 otherwise. 



A positive argument in favor of the requirement of continuity of D is that 
it is natural for Predictor to compete only with computable prediction strat- 
egy, and continuity is often regarded as a necessary condition for computability 
(Brouwer's "continuity principle"). 

The existence of universal RKHS on Euclidean spaces M™ (see fSJ implies 
the following proposition. 

Corollary 1 J/ X C M™ /or some m = 1,2,..., there exists a universally 
consistent on-line prediction algorithm. 

Proof Any on-line prediction algorithm satisfying ^ of Theorem ^ for a uni- 
versal RKHS J- on will be universal. Indeed, let A C X be compact, / be a 
continuous function on X, and e > 0. Suppose a;„ G ^4, n = 1, 2, . . . . Our goal 
is to prove that 

1 ^ 1 ^ 

n—1 n—1 

from some N on. It suffices to choose 13 G J-" at a distance at most e/ (8y) from 
/ in the metric C{A), apply Q to D, and notice that 



N N 



n—1 n—1 



- sr 2 
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(this calculation assumes that / and D take values in [— y, y]; we can always 
achieve this by truncating / and D: truncation does not lead outside the uni- 
versal RKHS described in 2J- ■ 



Remark It is easy to extend Corollary ^ to the case where X is a separable 
metric space or a compact metric space: indeed, by Theorem 4.2.10 in [201 
the Hilbert cube is a universal space for all separable metric spaces and for 
all compact metric spaces, and every continuous function on the Hilbert cube 
(we are interested in continuous extensions of continuous functions on compact 
subsets), being uniformly continuous (see, e.g., ^^Ij Corollary 2.4.6 on p. 52), 
can be arbitrarily well approximated by functions that only depend on the first 
m coordinates of their argument; it remains to notice that the on-line prediction 
algorithms satisfying the condition of Theorem^for universal RKHS on [0, 1]™ 
can be merged into one on-line prediction algorithm using, e.g., the Aggregating 
Algorithm. 

So far in this subsection we have only discussed the asymptotic notion of uni- 
versal consistency, although it is clear that one needs universality in a stronger 
sense. In practical problems, it is not enough for the benchmark class T to be 
universal; we also want as many prediction rules D as possible to belong to 
or at least to be well approximated by the elements of T] we also want 
to be as small as possible. The Sobolev spaces on [0, 1]™ discussed in f^lare not 
only universal RKHS but also include all functions that are smooth in a fairly 
weak sense. However, the Hilbert-space methods have their limitations: it is 
not clear, e.g., how to apply them to functions that are as "smooth" as typical 
trajectories of the Brownian motion. These larger benchmark classes seem to 
require Banach-space methods: see |57| . 

3 Implications for the statistical theory of re- 
gression 

So far we have not made any stochastic assumptions about the way the exam- 
ples are produced. In this section we derive simple implications from Theorem 
m for the statistical learning framework, assuming that the examples (x„,y„) 
are drawn independently from some probability distribution on X x [— y, y]. 
Similar implications can be derived from the results of |15| , [H] , and some other 
papers (see the next section); the corollary stated in this section, however, has 
somewhat better constants. 

Generalization bounds 

The risk of a prediction rule D : X — > R with respect to a probability distribu- 
tion P on X X [— y, y] is defined as 
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Our goal in this section is to construct, from a given sample, a prediction rule 
whose risk is competitive with the risk of small-norm prediction rules in a given 
RKHS. As shown in (with similar results obtained earlier in and before 
that in |38j ) . this can be easily done once wc have a competitive on-line algorithm 
(such as those in Theorems . 

Fix an on-line prediction algorithm and a sequence of examples 

{xi,yi), {x2,y2), ■■■ ■ 

For each n — 1,2,..., let _ff„ : X R be the function that maps each 
a: G X to the prediction /i„ e M output by the algorithm when fed with 
(xi, yi), . . . , (a;„_i, ?/„_i), X. We will say that the prediction rule 



_ 1 N 

Hn{x) :=^E^"( 



is obtained by averaging from the on-line prediction algorithm. 

Corollary 2 Let T be an RKHS on X, let D e T be such that D{x) e [-Y, Y] 
for all X e X, and let Hn, N = 1,2, . . be the prediction rules obtained by 
averaging from some on-line prediction algorithm guaranteeing For any 

probability distribution P on'K.y. [^Y, Y], my N — 1,2,..., and any S > 0, 



2Y 



nskp{HN) < riskp(L>) + I ^c2, + 1 (1|L»||^ + F) + 2yy 2 In - 



(8) 



with probability at least 1—5. 

Proof For a suitable choice of e > 0, we will have 

N 

N 



riskp {HN)<j^J2 ""'^^P ) (9) 

n=l 
1 ^ 

<J^J2^yn~HM)'+e (10) 

n=l 

1 ^ 9Y I 

n— 1 ^ 

N 

1 2Y I 

^nT. riskp(i?) + -=^0^ + 1 {\\D\\^ + y) + 2e 

n— 1 ^ 



(12) 



2Y 



riskp(Z?) + + 1 {\\D\\^ + Y) + 2e 



with probability at least 1 — 5. The inequalities ^ and 111() always hold: the 
first follows from the convexity of the function t i-^ t^, and the second from 
Theorem nj By Hoeffding's martingale inequality (01], Theorem 1 and the 
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remark at the end of §2; see also Theorem 9.1 on p. 135), pU|) and 112|l will 
hold with probability at least 1 — e~'^ n/{8Y ). nrakc the probability of their 
conjmiction at least 1 — (5, it suffices to find e from the equation e~'^ N/{e,Y ) __ 
(5/2, which gives 




In Corollary 12 we only consider prediction rules taking values in [— y,y]; 
this is not a real restriction if the RKHS J- satisfies D ^ T =J> |D| G JF, as the 
examples of RKHS considered in 21 do. 

Universally consistent procedures 

Suppose the object space X is the Euclidean space R™ for some m. It is easy to 
see that Corollary |21 implies the existence of universally consistent procedures in 
the sense of Stone 0^1 for a known upper bound Y on |y„|. Indeed, by Luzin's 
theorem (^Hl, Theorem 7.5.2 on p. 244; see also Theorem 7.1.3 on p. 225) for 
any Borel measurable prediction rule / : X ^ [— y, Y\ and any e > there exist 
a closed set F C X of probability at least 1 — e such that the restriction of / to 
A is continuous; it is obvious that we can also assume that F is compact. Let 
D be a function in a universal RKHS on X (the existence of the latter is shown 
in taking values in \—Y, Y\ and close to / in the metric C{F). It remains to 
apply Corollary 13 

Intuitively, the statistical assumption that the examples are produced inde- 
pendently from the same distribution is strong enough for the requirement of 
continuity to be superfluous: as Cover mentioned in his discussion of Stone's 
paper, it holds automatically with high probability. 

4 Examples of RKHS and reproducing kernels 

The usefulness of the results stated in the previous two sections depends on the 
availability of suitable RKHS. In this section I will only give simplest examples; 
for numerous other examples see, e.g., 1^21 j jS^i and )46| . 

The Sobolev spaces 

The Sobolev norm of an absolutely continuous function / : [0, 1] ^ M is 

defined by 

Iffm ■■= [\.fit)fdt+ f\f\t)fdt. (13) 
JO Jo 

The Sobolev space H^{[0,1]) on [0,1] is the set of absolutely continuous / : 
[0, 1] ^ M satisfying ||/|j^i < oo equipped with the norm It is easy to 

see that H^{[0, 1]) is an RKHS. 

In fact, H^{[0, 1]) is only one of a range of Sobolev spaces; see, e.g., for the 
definition of the full range (denoted VF*'^(f2) there; we are interested in the case 
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s = 1, p = 2, and = (0, 1), with the elements of W'^^'^{{0, 1)) extended to [0, 1] 
by continuity). The space H^{[0,1]) is the "least smooth" among the Sobolev 
spaces ^^^([0, 1]) if we ignore the slightly less natural case of a fractional s. All 
of H^{[0, 1]) are universal RKHS, but H^{[0, 1]) is a proper superset of all other 
-ff*([0, 1]), and so is the "most universal" Sobolev space of this type. 

It is easy to see that neither of the two addends in can be omitted: if 
the first addend is omitted, the square root of the right-hand side of H13() ceases 
to be a norm (since it becomes zero for every constant), and if the second ad- 
dend is omitted, the function space ceases to be an RKHS (since the evaluation 
functionals become unbounded). We can, however, "partially omit" the first 
addend replacing H13|) with the Fermi-Sobolev norm ||/||pg defined by 

II/IIfs^^ (£/(i)di) + j\nt)fdt (14) 

for absolutely continuous functions / : [0, 1] — > R. The Fermi-Sobolev space on 
[0,1] is the set of absolutely continuous / : [0, 1] — > M satisfying ||/||ps < oo 
equipped with the norm i|-||pg. It is clear that it is still an RKHS, and it is still 
universal. 

Of course, the underlying set Z of an RKHS docs not have to be a com- 
pact topological space: we can define the Sobolev norm ||/||^i of an absolutely 
continuous function / : R ^ M by essentially the same formula 

/OO /"OO 
{f{t)fdt+ {r{t)fdt (15) 

and define the Sobolev space H^{M.) on R as the set of absolutely continuous 
/ : R ^ R satisfying < oo. 

To apply Theorems HEl to these RKHS we need to know the value of cj^ for 
them; later in this section we will see that 

cjF = c^fi([04]) = \/ coth 1 « 1.15 
for the Sobolev space H^{[0, 1]), 

= cpg = 2/\/3« 1.15 
for the Fermi-Sobolev space on [0, 1], and 

cr = Cffi(R) = 1/V2« 0.71 
for the Sobolev space H^{M.). 

Remark The term "Sobolev space" usually serves as the name for a topo- 
logical vector space; all these spaces are normable, but different norms are 
not considered to lead to different Sobolev spaces as long as the topology 
does not change. The norms given by H13|l and H15() are the most standard 
ones. It is easy to see that the norm (|14|l leads to the same topology as (|13() : 
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II/IIfs — ll/llifi follows from the standard inequality between the and 
norms, and = 0(||/||pg) follows from Wirtinger's inequality (which im- 

phes that Jq P < !^{f'Y foi' every function / on [0, tt] such that /' is in L2 
and f = 0; for the statement and a proof of Wirtinger's inequality, see |29| . 
Theorem 258). 

We are often interested in the case where the objects Xn are vectors in a 
Euclidean space R™; if their components are bounded, we can scale them so 
that Xn G [0, 1]™. In any case, we can take the mth tensor power T of one of 
the three RKHS we have just defined as our benchmark class. (For the definition 
and properties of tensor products of RKHS see, e.g., 0j, §1.8.) We will sec later 
that Cjr for the mth tensor power is the mth power of the Cjr for the original 
RKHS. The mth tensor power of the Sobolev and Fermi-Sobolcv spaces on [0, 1] 
are universal on [0, 1]™ and the mth tensor power of H^iM) is universal on M™ 
(this can be seen from the construction given in 0j, §1.8). 

Theorem 13 requires separable RKHS; the separability of Sobolev spaces 
for integer s is proved in, e.g., Theorem 3.6 (and it also remains true for 
fractional s). 

Reproducing kernels 

An equivalent language for talking about RKHS is provided by the notion of a 
reproducing kernel; this subsection defines reproducing kernels and summarizes 
some of their properties. For a detailed discussion, see, e.g., or |41| . 

Let J- be an RKHS on Z . By the Riesz-Fischer theorem, for each z € Z 
there exists a function e such that 



The next lemma asserts that ||kz||^ is the norm Cjp(z) of the evaluation func- 
tional / ^ f{z). 

Lemma 1 Let T he an RKHS on Z. For each z G Z, 



/(z) = (k„/)^, V/e.F. 



(16) 




Proof Fix z G Z. We are required to prove 



sup |/(z)| = ||k,|| 



/:|I/II^-<1 



The inequality < follows from 





where / := k^/ Hk^jl^ and Hk^H^ is assumed to be non-zero. 
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The reproducing kernel of T is the function k : ^ R defined by 

k{z,z') := {k^,k^,)yr 

(equivalently, we could define ]i{z,z') as kz{z') or as kz'iz)). Tlie origin of this 
name is the "reproducing property" (|16() . 

There is a simple internal characterization of reproducing kernels of RKHS. 
First, it is easy to check that the function k(z, z'), as we defined it. is symmetric, 

k{z,z') =k{z',z), \f{z,z')eZ^, 

and positive definite, 

rn ?n 

^^a.,ajk{zi,Zj) > 0, 

t=i j=i 

V7n = 1, 2, . . . , (ai, . . . , a,n) e M™, (zi, . . . , z„,) G Z". 

On the other hand, for every symmetric and positive definite k : ^ M there 
exists a unique RKHS !F such that k is the reproducing kernel of Theorem 
2 on p. 143). 

We can see that the notions of a reproducing kernel of RKHS and of a 
symmetric positive definite function on have the same content, and we will 
sometimes say "kernel on Z" to mean a symmetric positive definite function on 
Z^ . Kernels in this sense are the main source of RKHS in learning theory: cf . 
[5211441 BH] . Every kernel on X is a valid parameter for our prediction algorithms; 
to apply Theorems we can use the equivalent definition of cjr, 

cjf = Ck sup \/k{x, x), (17) 

k being the reproducing kernel of JF. 

It was convenient to start from RKHS in stating the theorems of fJ2 but our 
prediction algorithms, two of which are explicitly described in ijlOl use the more 
constructive representation of RKHS via their reproducing kernels. 

Norm vs. the reproducing kernel in RKHS 

Finding the norm given the reproducing kernel and vice versa are often nontriv- 
ial problems for specific RKHS. The most popular methods appear to be the 
following. 

• As we saw in the proof of Lemma ^ kr/ Hk^H^^ is the function at which 

sup |/(z)| 

/:|I/II^<1 

is attained (assuming that Ilk^H^ ^ and that this optimization problem 
has a unique solution). Solving this optimization problem we can find the 
kernel k given the norm / i-^ ll/ll^r- For application of this method to the 
Fermi-Sobolev space on [0, 1], see Appendix C. 
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• One can use expansions into Fourier series of functions in a given RKHS. 
For examples see, e.g., |57], §4.2.1, or, for the Fermi-Sobolev space on 
[0,1], Eni (version 2). 

• If Z is a Euclidean space and the reproducing kernel k(z, z') only depends 
on the difference z — z' (is "translation-invariant"), an explicit formula 
for the reproducing kernel can sometimes be obtained by applying the 
Fourier transform to both sides of H16|l (similar methods are applied to 
the Sobolev space i/i(R) in [51] and gZ]). 

The reproducing kernel of the Sobolev space H^{[0, 1]), as given in jJUj (§7.4, 
Example 13; Exercise 3.12.7) with a reference to is 

cosh min(t, t') cosh min(l — 1 — t') 

"^(^'^ ) = sinhl 

This implies Marti's gD] result that 

o cosht cosh(l — t) cosh cosh 1 

Ck = sup — — = r-r-j = cothl, 

*G[o,i] smh 1 smh 1 

as stated above. 

The reproducing kernel of the Fermi-Sobolev space on [0, 1] was found in 
HSj (see also [HOI, §10.2, or EU, §2.3.3); it is given by 

k{t,t') = koit)k„{t') + k,{t)h{t') + k2{\t ~ t'\) 

= i min2(t, t') + i min2(l - i, 1 - i') + ^, (18) 

where ki := Bi/ll are scaled Bernoulli polynomials Bi. So, for the Fermi-Sobolev 
space on [0, 1] we have 

= max [ + -(1 - t)2 + - I 1. 

The reproducing kernel of the Sobolev space H^{M.) is 
k(t,t') = ^cxp(-|i-t'|) 

(see lU, g7], or ^Hj, §7.4, Example 24). From the last equation wc can see 
that chi(m) = I/V2. 

It is the general fact that the reproducing kernel of the m-fold product of 
RKHS can be obtained as the m-fold product of the reproducing kernels of the 
components (0, §1.8, Theorem I). For example, the reproducing kernel of the 
mth power of H^{[Q, 1]) is 

-Py cosh min(ti,^9 cosh min(l -U,l- t',) 
k((ti, . . . (ti, . . . ,t„jj - Yl ■ 

i—l 
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We can see that 

= (coth 1)™/' , = (2/^/3) " , = 2-"'/^ 

for the TOth power of the Sobolev space H^{[0, 1]), of the Fermi-Sobolev space 
on [0, 1], and of the Sobolev space respectively. 

An extensive list of RKHS together with their reproducing kernels is given 
in [ini, §7.4. 



5 Some comparisons 

The first paper about competitive on-line regression is |24| : for a brief review 
of the work done in the 1990s, see jSJ, §4. Our results are especially close to 
those of and 

There are two main proof techniques in the existing theory of competitive 
on-line regression: various generalizations of gradient descent (used in, e.g., [TS] . 

and and the Bayes-type Aggregating Algorithm (proposed in and 
described in detail in PO); for a streamhned presentation, see [SJ)- In this 
subsection we will only discuss the former; some information about the latter 
will be given in ^ 

Comparison between our results and the known ones is somewhat compli- 
cated by the fact that most of the existing literature only deals with the Eu- 
clidean spaces M™. Typically, when loss bounds do not depend on m, they 
can be carried over to Hilbert spaces (perhaps satisfying some extra regularity 
assumptions, such as separability), and so to some RKHS. To understand what 
such known results say in the case of RKHS, the upper bound on the size ||a;„|| 
of the objects (if present) has to be replaced by cyr (cf. the remark on p.EJl, and 
the upper bound on the size of the weight vector has to be interpreted as 
an upper bound on 

With such replacements. Theorem IV. 4 on p. 610 of Cesa-Bianchi et al. [T5| 
becomes 

N N 
J2iyn - y^nf < Y.^yn " ^(^^n))' 



9.2 Y, 



N 



. inf V(y„-i5(x„))2-t-y2 I 

\D:\\D\\^<Y/X^^' J 



where /x„ are their algorithm's predictions. This result is of the same type as 
(0}, but is bounded hyY/X; because of such a bound (present in all other 

results reviewed here) the corresponding prediction algorithm is not guaranteed 
to be universally consistent. 

Auer et al. make the upper bound on more general: their Theorem 

3.1 (p. 66) implies that, for their algorithm, 
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N N 



n— 1 n— 1 



\ n=l 



where U is a known upper bound on ||-D||jr and Y is assumed to be 1. This is 
remarkably similar to I0J and (O. 

This type of results was extended by Zinkevich Theorem 1) to a general 
class of convex loss functions. 

The main differences of these results from our Theorems ^I^l are that their 
leading constants are somewhat worse and that they assume a known upper 
bound on HZ?!!^. The last circumstance might appear especially serious, since it 
prevents universal consistency even when the Hilbert space used is a universal 
RKHS. However, there is a simple way to achieve universal consistency: the Ag- 
gregating Algorithm, or a similar procedure, may be used on top of the existing 
algorithm (the unknown upper bound may be considered to be an "expert" , and 
the predictions made by all "experts" , say of the form 2*^, k = 1,2,..., can be 
merged into one prediction on each round) . This was noticed by Auer et al. [H] , 
although they did not develop this idea further. 

The remaining minor component in achieving universal consistency is using 
a universal function class as the benchmark class. It is interesting that Cesa- 
Bianchi et al. used an "almost universal" function class in their pioneering paper 
[TK] (§V; their class was not quite universal because of the requirement /(O) = 0). 
A very interesting early paper about on-line regression competitive with function 
spaces (although not universal) is [3^ (continued by [23); it, however, assumes 
that the benchmark class contains a perfect prediction rule, and its results are 
very different from ours. 

A major advantage of the methods based on gradient descent is their sim- 
plicity and computational efficiency. The technique of defensive forecasting, 
which we emphasize in this paper, appears closer to gradient descent than to 
the Bayes-type algorithms. There has been a mutually beneficial exchange of 
ideas between the gradient descent and Bayes-type approaches, and combining 
gradient descent and defensive forecasting might turn out even more productive. 

Results such as Corollary El can be obtained by a routine application of 
well-known results in competitive on-line learning, but they might not be easy 
to obtain by the traditional methods of statistical learning theory. The closest 
results of this kind in statistical learning theory that I am aware of are Theorem 
C* (applied to Sobolev spaces and smooth kernels in Examples 3 and 4) of [T7j 
and Corollary 6.7 of [H|- These results, however, use balls in RKHS as benchmark 
classes, and therefore, do not guarantee even universal consistency. 

Corollary [3 can be strengthened by using the results of instead of those 

of [ig. 
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6 Proof of Theorem [T] 



This section is essentially a simplified (and to some degree cut-and-pasted) ver- 
sion of §§5-7 of [SB]. First we modify the protocol of introducing a third 
player, Skeptic, who is allowed to bet at the odds imphed by Predictor's moves. 

Forecasting Game I 

Players: Reality, Predictor, Skeptic 

Protocol: 

FORn= 1,2,...: 

Reality announces a;„ e X. 
Predictor announces fin G R- 
Skeptic announces s„ G M. 
Reality announces ?/„ g [— y,F]. 

ICn ■= /C„_i + SniUn " Mn)- 

END FOR. 

In this protocol, the prediction ^„ is interpreted as the price Predictor charges 
for a ticket paying s„ is the number of tickets Skeptic decides to buy. (We 
sometimes refer to predictions interpreted this way as forecasts, although the 
difference between forecasts and the decision-type predictions of is not as 
important here as for the more general loss functions considered in |55j.) The 
protocol describes not only the players' moves but also the changes in Skeptic's 
capital ICn] its initial value ICq can be an arbitrary real number. Protocols of 
this type are studied extensively in |45j . 

For any continuous strategy for Skeptic there exists a strategy for Predictor 
that does not allow Skeptic's capital to grow, regardless of Reality's moves. 
To state this observation in its strongest form, we make Skeptic announce his 
strategy for each round before Predictor's move on that round rather than 
announce his full strategy at the beginning of the game. Therefore, we consider 
the following perfect-information game: 

Forecasting Game II 

Players: Reality, Predictor, Skeptic 

Protocol: 

FORn= 1,2,...: 

Reality announces x„ G X. 

Skeptic announces continuous S'„ : M — > M. 

Predictor announces /i„ G M. 

Reality announces ?/„ G [— y, F]. 

ICn := JCn-l + Sn{fln){yn — Mn)- 

END FOR. 

Lemma 2 Predictor has a strategy in Forecasting Game 11 that ensures /i„ G 
[-r, Y], for all n = 1, 2, . . and JCq > JCi > IC2 > ■ ■ ■ ■ 
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Proof Predictor's goal is achieved by the following strategy: 

• if the function Sn takes value on the interval [— y, y], choose fin S 
[-y,y] such that SnifJ-n) = 0; 

• if Sn is always positive on [—Y, Y], take ^„ := Y; 

• if Sn is always negative on [—Y, Y], take //„ := —Y. I 
Algorithm of Large Numbers 

We say that a kernel k on [~Y, y] x X is forecast- continuous if the function 
k{{fi,x),{ij,' ,x')) is continuous in {fJ.,fi') G [—Y,Y]'^, for all fixed {x,x') G X^. 
For such a kernel the function 

n-l 

Sn{f^) ■■= ^ k((/x, Xn), ifJ-i, Xi)) (ui - /^j) - k((/^, Xn), (m, a;„))/i (19) 

i=l 

is continuous in G y, y]- 

The algorithm of large numbers (ALN) 
Parameter: forecast-continuous kernel k on [—Y, y] x X 
FORn= 1,2,...: 
Read a;„ e X. 

Define Sn : [-y, y] ^ R by 1121). 
Output any root /i e [— y, y] of 5„(^) = 
if there are no roots, set fin '■— y 

sign Sfi . 

Read y„ G 
END FOR. 

(Notice that sign5„ is well defined in this context.) It is well known that for 
each kernel k on [~Y, y] x X there exists a function $ : [— y y] x X ^ 7^ (a 
feature mapping taking values in a Hilbert space Ti.) such that 

k(a, b) = ($(a), $(&))^ , Va, 5 G [-Y, Y] x X. (20) 

(For example, we can take the RKHS on [—Y, y] x X with reproducing kernel 
k as 7i and take a i— > k^ as the feature mapping $; there are, however, easier 
and more transparent constructions.) It can be shown that <I>(/x, a;) is forecast- 
continuous, i.e., continuous in G [— y, y] for each fixed a; G X, if and only if 
the kernel k defined by is forecast-continuous (see, e.g., [SEIj Appendix B, 
where [0, 1] should be replaced with [— y, y]). 

Theorem 5 Let k be the kernel defined by Y2U\) for a forecast-continuous feature 
mapping $ : [—Y, y] x X ^ 7i, where Ti is a Hilbert space. The ALN with 
parameter k outputs fin G [—Y, Y] such that 

N 

^{Vn - Mn)5'(M«,a;„) 

n=l 



N 

<Y,{Y'-fll)mfln,Xn)\\'n (21) 
n=l 
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always holds for all N = 1,2,.... 

Proof Following the ALN, Predictor ensures that Skeptic will never increase 
his capital with the strategy 

Sn ^ k((^„, Xn), {pLi, Xi)) {yi - ^i) - k((/i„, Xn), {tln,X„))fl„. (22) 
1=1 

Using the inequalities 
and 

k((/i„,a;„), (/i„,x„)) > 
we can see that the increase in Skeptic's capital when he follows ()22|l is 

N 

ICn ^ ^0 = ^ Sniyn - fJ-n) 
n=l 
N n-1 



= X! X! k((/in,a;„), {^.i,Xi)){yn - f^n){yi - Mj) 
n—1 i—1 
N 

^ ^ k((Mn; ^n); (/^ni ^ri)) P^nijjn 
n=l 
^ N N 

= l^'^^^{{^^n,Xn),{^^^,Xi)){yn - lin){yt -^0 
n—1 z— 1 

1 ^ 

- 2 X! ^((/^»' ^")' (^»' ^")) (y» ~ ^")^ 

n=l 

TV 

- ^ k((/i„, a;„), (^„, a;„))^„(y„ - /x„) 

ri=l 

n—1 2—1 

1 " 

- ^ X! ^((Z^"' ^")' (A^"' ^")) (^^ - Mn) 



n=l 
N 



n=l 



1 " 

-^(r2-/4)||$(^„,x„)|i?, 



7^ n=l 



which immediately implies H21|l . 
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Resolution 



This subsection makes the next step in our proof of Theorem ^ Our goal is to 
prove the following result (although we will need a slight modification of this 
result rather than the result itself). 

Theorem 6 Let T he an RKHS on X with reproducing kernel k. The forecasts 
G [^Y, Y] output by the ALN with parameter k always satisfy 

N 
n=l 

for all N and all functions D ^ T . 

Proof Using (|21|) with $ being the feature mapping a; G X i-^ k^, e J^. we 
obtain 



< Ycr\\D\\jr\fN 



N 



^{yn - fJ.n)D{Xn) 



N 



n=l 



N 



^{yn - /i„)ka;„,D 



3^ 



< 



N 



^{yn - 



\D\ 



< \\D\ 



N 



. J2 y^^n, x„) < Yc^ (23) 

\ Tl=l 



for any D E T. 



Theorem El can be interpreted as asserting that the ALN has a good "reso- 
lution" when is a universal RKHS; for details, see |56| . 



Mixing feature mappings 

In the proof of Thcorcm^we will mix the feature mapping ^o{fi, x) := fi (into 
Ho := M) and the feature mapping $i(/^, x) :— kx used in the proof of Theorem 
(we will have to achieve two goals simultaneously) . This can be done using 
the following corollary of Theorem 

Corollary 3 Let $j : [— F, y] x ^ ^ Hj, j = 0,1, be forecast- continuous 
mappings from [—Y,Y] x X to Hilbert spaces Ti.j, and let ao,ai be two positive 
constants. The forecasts fin G [^Yi Y] output by the ALN with a suitable kernel 
parameter always satisfy 



N 

E 

n=l 



(yn - A^„)<i>j(/i„, Xn) 



Hi 
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< — X! \\'^o{^J■n,Xn)\\l^^ + ai ||$i(^„,a;„)||^^ 



for all N and for both j ^ and j — 1. 

Proof Define the "weighted direct sum" H of Ho and Hi as the Cartesian 
product Ho X Hi equipped with the inner product 

1 

{9,9')h = ((5o,gi), (3o'5i))-H ■'='^'^o{9j.9'j)Hr 

Now we can define $ : F] x X ^ by 

^in,x) := ($o(/i,2:),$i(/^,a;)); 
the corresponding kernel is 

1 1 



3=0 



where kg and ki are the kernels corresponding to $o and $i, respectively. It 
is clear that this kernel is forecast-continuous. Applying the ALN to it and 
using (|^ . we obtain 



AT 



Hi 



N 

71 7 -^n ) 



< 



n=l 
N 



^iVn - Hn)'^{lJ-n,Xn) 



n=l 
2 



H 



N 



<y2^||$(A.„,a;„)||?, 

H "=1 

N 1 

= ^^XX"J' ll*j(Mn>2;„)||^^. 

Tl=l j=0 

Merging ^o{^,x) — fi and $i(/Lt,a;) = k^; by CoroUaryOl we obtain 



N 



^{Vn - l^n)lJ'r, 



N 



X(y« ^ A«ri)$o(/i«, 3:^71) 



< 



Y 



X(»oM^ + aik(a;n,a;„)) (24) 
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and, using (^5)1 . 

N 



< 



N 



n=l 



D\ 



\D\ 



N 



+ aik(x„,x„)), (25) 



for each function D ^ T . 



Proof proper 

The proof is based on the elementary inequahty 

N 

n=l 

N N 

= ''^{Vn - DiXn))"^ + 2^(L»(x„) - ^InjiVn - l-hi) - '^{D{Xn) - ^JLn) 
n—1 71—1 
N N 
- ~ D{Xn))'^ + 2^{D{Xn) - lln)[Vn - M") 



N 



n=l 



(26) 



(the intermediate equahty follows from ~ (a — 6)^ + 2ab — 6^). Using this 
inequality and (|24|) - (|25|l . we obtain for the fXn G 5^] output by the ALN 
with the merged kernel as parameter: 



AT 



n=l 

N 

< Y,iyn-D{Xn))^ + 2 



N 



N 



n=l 



N 



Y,D{x,,){y 



< Y^ivn - Dixn)f + 2Y \\D\\^ + 



n=l 
iV 



N 



<f](y„-D(x„))2 + 2rf^ ^ 



It remains to set a\ := 1 and oo := l/Y'^ 
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7 Proof of Theorem [2] 

In this section we will modify (essentially, further simplify) the proof of Theorem 
^ given in the previous section to obtain the proof of Theorem |21 

K29 algorithm 

A kernel k on [—Y, F] x X is K29-admissible if the function k((/i, x), [jj! ,x')) is 
continuous in /i e [— y, y] for all fixed ^' e [— y, y], a; G X, and x' E X. For 
such a kernel the function 

ri-l 

Snif^) := ^k((/i,x„),(/ij,a;j))(2/j -/ij) (27) 

1=1 

is continuous in /i S [— y Y]. 
The K29 algorithm 

Parameter: K29-admissible kernel k on [— y, y] x X 
FOR 71= 1,2,...: 
Read a;„ e X. 

Define Sn : [-y, y] ^ R by 123. 

Output any root /i e [— y, y] of S'„(//) = as /i„; 

if there are no roots, set := y signS'„. 
Read y„ G [-5^,^]- 
END FOR. 

Let us say that a feature mapping $(/i,j:) is K29- admissible if the kernel k 
defined by is K29-admissible. 

Theorem 7 Let k 6e the kernel defined by \2U\) for a K29-admissible feature 
mapping $ : [— y, y] x X ^ 7Y. The K29 algorithm with parameter k outputs 
/x„ S [— y, y] such that 

N 

'^{Vn - Ai„)$(/i„,a;„) 

n=l 

always holds for all N = 1,2, ... . 

Proof Following the K29 algorithm Predictor ensures that Skeptic will never 
increase his capital with the strategy 

ri-l 

Sn ■= ^ k((^„,x„), {fii,Xi)){yi - /ij), 

i=l 

which implies 

N 

> ICn - ^0 = ^ Sn(yn - IJ-n) 
n=l 



N 

<J2{yn-linfmiln,Xn)\\n (28) 

n=l 
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N n-1 



n—1 i—l 
^ N N 

n—1 2—1 

1 ^ 



n=l 



n=l 



1 ^ 



2 



H 



n=l 



which in turn impUcs (|28|l . I 

Mixing feature mappings 

Now we have the following corollary of Theorem |3 

Corollary 4 Let (f>j : [— y, y] x 'K Hj , j = 0,1, be forecast- continuous 
mappings from [^Y, y] x X io Hilbert spaces TLj, and let aj, j = 0, 1, be positive 
constants. The forecasts fin G [^Y, Y] output by the K29 algorithm with a 
suitable kernel parameter always satisfy 



N 



n=l 

1 ^ 

< — ^iVn - IJ-nf (aa \\'^o{P-n,Xn)\\lc„ + «! II *1 (A^" , ^^ri ) 1 1 ^ 



71— 1 

for all N and for both j ~ and j — 1- 

Proof Being forecast-continuous, the kernel k defined in the proof of Corollary 
|3|is a fortiori K29-admissible. Applying the K29 algorithm to it and using 
we obtain 



N 



< 



N 



n , Xfi ^ 



N 

- ^ ll*(Ai«,2;„)ll 

U n=l 

N 1 

n=l j=0 



24 



Merging $o(/^, x) = /.i and a;) = k^, by Corollary 01 we obtain 



N 



N 



n : "^n ) 



< 



> n— 1 



N 



\Y.^yn-lin? (29) 
\ n=l 



and, using (^5)) . 



< 



AT 



n=l 

n 5 -^Ti / 



\D\ 



< \\D\ 



r 



\ 



1 ^ 



< —=\\B\\jr \ a^c\ + a^Y'^ 
Vol V 



AT 



\ n=l 



for each function D <E T. 



Proof proper 



Using (|26|l and H29|I - H3U() with := a and ai := 1, we obtain for the /i„ output 
by the K29 algorithm with the merged kernel as parameter: 



N 



n=l 

N 

< J2{y^ - D{Xn)f + 2 



n=l 
N 



N 



N 



^ D{Xn){yn - fJ.n) 



n=l 



^ I / 1 \ ^ 

< Y^ivn - D{xry + 2.^4 + aY^[\\D\\j. + ^\ . Y.^y.- 

n=l ^ "^^"^ \ ri=l 



-Mn)^. 

The inequality between the extreme terms of this chain is quadratic in 



N 



\ n=l 
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solving it, we obtain 



/ 1 \ 

^ Y.(y^ - A^")' ^ \ E(2^" - ^(^"))' + (4 + aY^) ii^ii^ + ^ 

which is equivalent to when a = 1/Y'^. 

8 Bayes-type competitive on-line regression and 
proof of Theorem [31 

The first result in the Bayes-style competitive on-line regression appears to be 
the following: if the benchmark class J- consists of the linear functions D{x) = 
{9,x) on X = R™ whose "complexity" is measured by the norm \\9\\2 ■= 
V^iLi '-'f ^'^ components 6i and if a is a positive constant, some on-line 
prediction algorithm (namely, the Aggregating Algorithm) ensures 



N 



N f 1 ^ \ 

<J2^y^- ^"))' + « 11^112 + indct [I + -J2 (31) 

n=l \ n=l / 

N m / N \ 

< J2 iVn - ^n)f +a\\0\\l + Y'J2^nil + -Y^ xl, , 

n— 1 i—1 \ 71 — 1 / 

for all N and all 6* S M™ ([SJ, Theorem 1; different proofs are given in [7], 
Theorem 4.6, and |23j). In particular, if ||6'||2 and all components Xn^i of all Xn 
are bounded by a constant, 

N N 

E (yn " Mn)' < E (y„ - (0, x„) )' + O (In iV) ; 

n—1 n—1 

it is interesting that the regret term is now O(lnA^), rather than 0{y/N) as in 

©■ 

We arc, however, interested in the infinite-dimensional benchmark classes. 
The result H31|) was carried over to separable RKHS in there is an on-line 
prediction algorithm that ensures 

N N / 1 \ 

E(y„-/in)' < E(y„-i?(a;„))'-Hap||^ + y2indet [I+-K) (32) 

n=l n=l ^ / 

for all N and all prediction rules D in a separable RKHS on X, where K is 
the N X N Gram matrix with the elements Kij :~ \i{xi,Xj), i,j = I,. . .,N, 
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and k is ^'s reproducing kernel. (Actually this result is stated in only for 
prediction rules D of the form Cik^. , where k & {1,2,...}, ci, . . . , € M, 

and zi , . . . , G X; but the result is true in general since such prediction rules 
are dense in JF: see @], §1.2. (4). Alternatively, the general result follows by the 
representer theorem, stated in, e.g., and Theorem 4.2 on p. 90.) 
A disadvantage of the bound H32() is that, for a fixed a, the term 

Indet ^/+ -K 

(which also occurs in Theorems 3.1 and 3.2, and ^21) can have order of 
magnitude N: indeed, if 



k(x2 , Xj ) 

this term becomes 



1 if i = j 
otherwise, 



InJI 1 + - =iVln(l + l/a). 



n=l 



More generally, Minkowski's result from ^9; , Chapter 2, Theorem 15, shows that 
Indet ( / + -K] > TVln ( 1 + det | -K 



and so this term will not be small as compared to N unless det K < (ae)^ for 
a small e > 0. 

Our argument in the previous paragraph assumed that a was fixed. Let us 
now see what ()32|l leads to when and an upper bound d on 1 1 -D 1 1 j;r are given in 
advance, which gives some scope for optimizing a. In conjunction with the fact 
that the determinant of a positive definite matrix does not exceed the product 
of its diagonal elements ([5], Chapter 2, Theorem 7), (|32|l implies 

N N , 

5Z(yn-/in)' < ^(y„-Z?(x„))' + a|li?!|^ + y'Arinf 1 + ^ 

n=l n=l ^ ^ 

N 2 

< V (y„ - D{x^)f + ad" + —Ezil. (33) 

* ^ n 



n—1 

The minimum of ad^ + Y'^c^jrN/a is achieved at a = {Ycyr/d)y^N, and for this 
value of a (|33|l becomes 

N N 
n—1 n—1 

We can see an analogue of the familiar term 2Ycj:^ \ \D\\jr^N. The expression 
^Y^l^N+^^^^p^+0{Y^) 
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in (|3J) can be interpreted as the price that we pay for not knowing |l-D|ljf and 
TV in advance. 

Kummer's U function 

In the proof of Theorem |21 we will need an approximation to Kummer's U 
function 

1 r°° 

U{a,b,z) -.^ —— e-'H''-\l + tf~''-^dt (34) 



r(a) Jo 

from 120], P- 44, (26a). A concise statement of this result is given in |2I], p. 281, 
§6.13.3, (22) (in fact, Taylor states his result in terms of the closely related 
Whittaker function; [21] states it in terms of Kummer's U function, which is, 
however, denoted cf. (2) on p. 255; in using the notation U we are following 
Chapter 13: cf. 13.2.5 on p. 505). The approximation is given by the formula 

^_,^,/2-i/4(^ _ 4Ai)i/4e«--/2t/(a, b, z) 

[l + 0(\nr)+0(\i\-')), (35) 



where r £ (0, 1], 



6/2- a. 



(zV2 + (^_ 4^)1/2)- 1 

and it is assumed that ^ ^ oo and 

\z\ > 5\k\-^+'''- (36) 

for some constant (5 > 0. We are only interested in the case z > 0, a > 1, and 
b e [0, 1], which also implies k < 0. Since ln(— 1) = ±z7r, the expression for 
can be rewritten as 

^^ = . m + + „ 1 V2(, + 4|,|) V2 ± (37) 

4|k| 2 ^ ' ^ ' 

and we can see that 3?(«^) < and, therefore, arg^ £ (0,7r); we have already 
used this fact in choosing the expression (|35() among the expressions given in 
ini, P- 281, §6.13.3 ((21)-(24)). Using ^ and the fact that |^| > |K|7r, we 
deduce from 

In U(a, b, z) = kIhk + ( llnz ln(2 — 4k) — k H — 

V4 2/ 4 2 

+ i^ + 0(^\K[ 

Kln|K| + Inz - iln(z-4K) - + | 
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1 1/2. . M/. , {7y^ + {z + A\n\f/^f ^/ 
-7y\z - 4^)1/2 + Acln ^ '^1^1 ^ + O {\n 

= Inz In K] In2-K+- 



4 27 4 V4 / 2 2 

I h\ I f h z\ I b z 

--- lnz--ln a-- + - --ln2 + a- - + - 
42/ 4V24y2 22 



+ 0{\k\ 

By Stirling's formula (P, p. 257, 6.1.41), 

Inr(a) = -a + (a - i j Ina + ^ ln(27r) + O (fl-^) , 



which for be [0, 1] gives 



-\n{r{a)U{a,b,z))=[^-^)\nz+\ln(a-^ + ^^ ^ 



z 

2 4 / ' ■ 4 V " 2 ■ 4 / ■ 2 2 



2/ \V 2 4/ \4J 



a— — jlna— — ln7r + 0(a 









^ 6 




b 


z 


a- 


i)lnzH 






















+ 2 


^ 2 



2 4/ V 2/\V 2a 4a/ V4a/ 



+ Q-01na-iln7r + O(a-'-). (38) 



Proof of Theorem [SI 



To get rid of the parameter a in the first inequality of H33|l . we will merge the 
AA predictions (truncated to [—Y, Y] if necessary) corresponding to all possible 
a £ (0, oo) w.r. to the probability measure 

on (0, oo); here and in what follows we let c stand for Cjr and e for a constant 
in (0,1] to be chosen later. Taking rj := 1/{2Y'^) (see towards the end 
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of §2.4) and /3 := e"'', making use of Lemmas 1 and 2 of [SJ, and letting d 
stand for ||-D|| jr, we obtain the following bound for the excess loss of the merged 
predictions over D's predictions over the first N rounds: 



/-I f / 2 \ ^ 'I ^ 

^ad=+Y^JVln(l+cVa)Q(^„) = - i In J 6-""'^' (l + Q{da) 

I- / 2\-^/2 

= -2y2 In / e-'^d'/i^y') (1 + ^1 Q{da) 

= -2Y^ In {^ec^^ g-'^VCsr^) ^ ^2yN/2-i-. ^n/2^^ 

Substituting c^t for a transforms this to 



-2y2 In e 



which, by H34(l and H38|l . can be written as 



, , ,'iV \ /TV c2d2 
2^2 In er — + 1 [/ — + 1, 1 - e. 



2 7 V 2 '2^2 
= -2y2lne + rM--e In— -r + — In 



2 J 2^2 ' 2 V 2 2 2 

, C2d2 / c2d2V/2 

+ - e) + Ycd iV + 1 + e 



2 ' V 4^2 y 



2y^(iV + l + e)ln 1 - 



1-e c2d2 \i/2 cd 



N + 2 4y2(Ar + 2); 2yViVT2 



+ tY'^ In + 1^ - y2 In ;r + (TV"'') . (39) 

Remember that the validity of the approximation H35|l requires the condition 
P6|l . In the most interesting case 1 <C z <C k we can take r := 1/2 to get the best 
bound, corresponding to the accuracy of 0(iV~^/2). unfortunately, such a bound 
would be rather clumsy, and the following transformations (which sometimes 
are inequalities rather than equalities) are performed to a much worse accuracy, 
0(y2), Xo make sure that holds it is now sufficient to assume that 

> <52Ar-i+2« (40) 



Y2 



for some constant 5] we will first assume that this condition holds, and at the 
end of the proof will get rid of it (although not completely: it survives in the 
presence of "max" in ©). 

The fourth addend from the end of H39|) can be bounded above as follows: 
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^,.9/,. s ( c^d"^ cd \ 

c^d'^ . 

< + YcdVN + 1 + e. 

4 

This allows us to bound (|39|l from above by 

(1 _ ,„ 2? + 1! ,„ f » + V — + f 'v + 1 + . + V" 

+ + YcdVN + 1 + 6 + 6Y^ \nN + (Y^) 

4 ^ ' 

cd /I \ , c^d"^ 

< (2-2e)r2ln— + i-+e]Y^\iiN + 2Ycdy/N + 1 + e + — + O (F^) . 

If we take 6 = 1, this will give 

-YHnN + 2YcdVWT2+^-^ + 0(Y^) , (41) 
2 4 ^ ^ ' 

i.e., The choice of e = 1 appears to lead to the simplest regret term, but 
notice that by choosing e close to we can improve the constant in the 
second leading addend in the regret term in ||SJ) making it close to ^Y'^. 

Returning to the condition (|40|l . it is easy to check that the bound (|41|l will 
remain valid without this condition if cd is replaced by 

P max (cd, YSN-^/^+^'j ; 

indeed, this immediately follows from the monotonicity of T(a)U{a,b, z) in z. 
It remains to notice that the difference between the addend (crf)2/4 in (|41|l and 
p2/4 can be accommodated in the 0{Y^) term, so this addend can be left as it 
is. 



9 Proof of Theorem |H 

Let be the RKHS corresponding to the kernel on X = M defined by k(x, x') := 
(?h{x — x'), where /i : R — * R is the "triangular" function h{t) := max(l— |i|,0); 
the positive definiteness of k follows from Bochner's theorem (see, e.g., |22| . 
§XIX.2) and Polya's theorem (|22i Example (b) in §XV.3). Representation 
P7|l shows that c^c- = c. 

Reality's strategy is a;„ := 2n and ?/„ := ±Y , with sign(y„) opposite to 
sign(/i„) (when /i„ = 0, sign(j/„) is chosen arbitrarily). This will make sure that 
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the loss of the on-hne prediction algorithm over the first N rounds is at least 

Let /„ e J-" be the function defined by /„(a;) := ch{x — 2n), n = 1, 2, . . . . It 
is clear that the functions /„, n = 1, 2, . . ., are orthogonal and = 1- Set 

a := cd/{Yy/N) and let the decision rule D G J- he defined by 

N 

-D := a V — /„; 

1 c 

n— 1 

one of the conditions of the theorem ensures that a < 1 and, therefore, D takes 
values in [— F, y]. The loss of D over the first N rounds is (1 — a)'^Y^N and 
the norm of D is Iji'lljir = a{Y/c)y/N = d. We can see that the excess loss of 
the prediction algorithm as compared to D is 

Y'^N - (1 - afY'^N = (2a - a^)Y'^N = (2 - a)YcdVN, 

which completes the proof. 

10 The algorithms 

In this short section we extract the prediction strategies achieving and 
(@J from our proof of Theorems and [21 Replacing in p9|l the kernel 
k((^,a:), {^',x')) by the merged kernel ^i^l'/Y^ + (k^jjlc^')^, we obtain 

ri-l 

SnifJ-) = '^(^fJ.^J.i/Y^ + k{xn,Xi)j {yi - ^i) - (^/i^/y^ + k(a;„,a;„)^^; (42) 

i=l 

this immediately leads to the following explicit description for the on-line pre- 
diction algorithm we used in the proof of Theorem ^ 

An algorithm achieving (O 
Parameter: the reproducing kernel k of 
FOR 71= 1,2,...: 
Read a;„ £ X. 

Define S^{^i) by for aU ^ £ [-Y, Y]. 
Define ^„ as any root /i € [— y, Y] of Sn{^J■) — 0; 

if there are no roots, set /x„ := Y sign Sji . 
Read y„ G [-Y,Y]. 
END FOR. 

To obtain an algorithm achieving Q), it suffices to replace 142(1 by 

n-l 
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